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Isotonic regression is a nonparametric approach for fitting mono- 
tonic models to data that has been widely studied from both theo- 
retical and practical perspectives. However, this approach encounters 
computational and statistical overfitting issues in higher dimensions. 
To address both concerns, we present an algorithm, which we term 
Isotonic Recursive Partitioning (IRP), for isotonic regression based 
on recursively partitioning the covariate space through solution of 
progressively smaller "best cut" subproblems. This creates a regu- 
larized sequence of isotonic models of increasing model complexity 
that converges to the global isotonic regression solution. The models 
along the sequence are often more accurate than the unregularized 
isotonic regression model because of the complexity control they offer. 
We quantify this complexity control through estimation of degrees of 
freedom along the path. Success of the regularized models in predic- 
tion and IRPs favorable computational properties are demonstrated 
through a series of simulated and real data experiments. We discuss 
application of IRP to the problem of searching for gene-gene inter- 
actions and epistasis, and demonstrate it on data from genome-wide 
association studies of three common diseases. 



1. Introduction. In predictive modeling, we are given a set of n data 
observations (xi, yi), . . . , {xn,yn), where x £ X (usually X = W^) is a vector 
of covariates or independent variables, y S M is the response, and we wish 
to fit a model / : A" — ?■ M to describe the dependence of y on x. The most 
common approach traditionally used in statistics to accomplish this task is 
to seek a model that minimizes in-sample squared error (I2) loss, that is. 
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where J- is a class of candidate models. The use of squared error loss can 
be interpreted as maximum likelihood fitting under an additive Gaussian 
noise assumption, or, more generally, as trying to estimate E(y|a;) [Gneiting 
(2011)]. Other loss functions can be used to represent other estimation goals. 

Isotonic regression is a nonparametric modeling approach which restricts 
the fitted model to being monotone in all independent variables [Barlow and 
Brunk (1972)]. Define Q to be the family of isotonic functions, that is, g gQ 
satisfies 

xi^X2 g{xi) <g{x2), 

where the partial order ^ here will usually be the standard Euclidean one, 
that is, xi :< X2 if and only if xij < X2j coordinate-wise. Given these defini- 
tions, standard I2 isotonic regression solves 



We denote by / the optimal solution to (1). As many authors have noted, / 
comprises a partitioning of the space X into regions with no "holes" that 
satisfy isotonicity properties defined below, with a constant fitted to / in 
every region. 

In terms of model form, isotonic regression is clearly very attractive in 
situations where monotonicity is a reasonable assumption, but other com- 
mon assumptions like linearity or additivity are not. Indeed, this formulation 
has found useful applications in biology [Obozinski et al. (2008)], medicine 
[Schell and Singh (1997)], statistics [Barlow and Brunk (1972)] and psychol- 
ogy [Kruskal (1964)], among others. In recent years, an exciting new applica- 
tion area has emerged for this approach in genetics: modeling genetic inter- 
actions in heritability. Many papers have noted the apparent insufficiency of 
standard additive modeling approaches in describing the combined effects 
of genetic factors (e.g., mutations) on phenotypes like height and disease 
[Goldstein (2009), Eichler et al. (2010)]. Some findings in mice have pointed 
to subadditive interactions [Shao et al. (2008)], while others suggest requir- 
ing super-additive assumptions in order to explain heritability [Goldstein 
(2009)]. It is generally accepted, however, that while the effect of one ge- 
netic factor on a phenotype can be modulated, enhanced or even eliminated 
by other genetic factors, it is not expected to reverse direction [Mani et al. 
(2007), Roth, Lipshitz and Andrews (2009)]. In other words, the isotonic- 
ity assumption with respect to genetic effects is widely accepted, but the 
form of epistasis (genetic interaction) between factors is not clear and may 
vary between phenotypes. Other properties of this application domain also 
favor the use of isotonic regression as we discuss below. It should be noted, 
however, that most discussions of epistasis have been theoretical, with exten- 
sive systematic efforts at discovering actual epistatic combinations in human 
disease, resulting in very limited [Emily et al. (2009)] or no results [Cordell 
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(2009)]. These efforts have utihzed simple statistical approaches (chi-square 
tests, logistic regression) to search for low-dimensional interactions, mostly 
of two mutations at a time. The question of whether their lack of findings is 
due to limitations of methodology and concentration on low dimension, or 
lack of real epistatic signal, is a key one, and it can be addressed by isotonic 
modeling. 

Two major concerns arise when considering the practical use of isotonic 
regression in modern situations as the number of observations n, the data 
dimensionality d, and the number of isotonicity constraints m = 
Xi ^ Xj}\ implied by (1) all grow large: statistical overfitting and compu- 
tational difficulty. The notations n, d and m will refer to these quantities 
throughout the paper. 

The first concern is statistical difficulty and overfitting. Beyond very low 
dimensions, the isotonicity constraints on the family Q can become ineffi- 
cient in controlling model complexity and the isotonic regression solutions 
can be severely overfitted [e.g., see Bacchetti (1989) and Schell and Singh 
(1997)]. At the extreme, there may be no isotonicity constraints because 
no two observations obey the coordinate-wise requirement for the ^ order- 
ing. The isotonic solution in this case simply assigns /(xj) = yj, providing 
a perfect interpolation of the training data. As demonstrated in the litera- 
ture [Schell and Singh (1997)] and below, the overfitting concern is clearly 
well-founded when considering the optimal isotonic regression model implied 
by (1), even in nonextreme cases with a large number of constraints. In this 
case, regularization, that is, fitting isotonic models that are constrained to 
a restricted subset of Q, could offer an approach that maintains isotonicity 
while controlling variance, leading to improved accuracy. 

A second concern is computational difficulty. The discussion of isotonic re- 
gression originally focused on the case a; € M, where ^ denoted a complete or- 
der [Kruskal (1964)]. For this case, the well-known pooled adjacent violators 
algorithm (PAVA) efficiently solves (1) in computational complexity 0{n). 
Low complexities can also be found when the isotonic constraints take a spe- 
cial structure such as a tree [0(n log n) in Pardalos and Xue (1999)]. Various 
algorithms have been developed for the partially ordered case, including the 
classical approach of Dykstra and Robertson (1982) for data on a grid, gen- 
eralizations of PAVA [Lee (1983), Block, Qian and Sampson (1994)] and 
active set methods [de Leeuw, Hornik and Mair (2009)]. These approaches 
offer no polynomial complexity guarantees and are impractical when data 
sizes exceed a few thousand observations (in some cases much less). Inte- 
rior point methods offer complexity guarantees of 0(max(m,n)^) [Monteiro 
and Adler (1989)], however, they are impractical for large data sizes due to 
excessive memory requirements. A much more computationally attractive 
approach can be found in the optimization and operations research litera- 
ture. The basic idea of this approach is to repeatedly and "optimally" split 
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the covariate space X into regions of decreasing size by solving a sequence 
of specially structured best cut problems for which efficient algorithms ex- 
ist. At most n partitions are needed, leading to a computational complexity 
bounded by O(n^), and in some cases even less. From a practical perfor- 
mance perspective, this algorithm can obtain an exact solution of (1) for 
data sets with tens of thousands of observations in minutes. The first ap- 
pearance of this approach, to our knowledge, is in the work of Maxwell and 
Muckstadt (1985) [and similarly Roundy (1986)], who presented a model for 
finding reorder intervals in a production-distribution system. At the center 
of their problem was a regression subject to isotonicity constraints, but with 
a different loss function than that in problem (1), to which they provided an 
algorithm based on partitioning. Applicability of their algorithm with min- 
imal changes to problem (1) was more recently noticed by several authors 
[e.g.. Sponge, Wan and Wilbur (2003)], who used it to state a similar effi- 
cient partitioning scheme for isotonic regression. A similar, highly efficient, 
algorithm by Hochbaum and Queyranne (2003) also solves problem (1) un- 
der the additional constraint that fits take integer values. This partitioning 
approach does not appear to be well known in the statistics community, and, 
indeed, we have independently developed it before discovering it is already 
known. 

The literature cited above invariably refers to this iterative splitting algo- 
rithm merely as an approach for efficiently arriving at the optimal solution 
of (1). However, as noted before, this solution can be highly overfitted, es- 
pecially as the dimension d increases. Our main interest lies in analyzing 
the iterative approach as a means toward resolving the overfitting problem, 
as well as the computational issue. We propose to view this iterative algo- 
rithm as a recursive partitioning approach that generates isotonic models of 
increasing model complexity, ultimately leading to the solution of (1); the 
algorithm is termed Isotonic Recursive Partitioning (IRP). We prove that 
the models generated by the IRP iterations are indeed isotonic (Theorem 4) 
and consider them as a regularization path of increasingly complex isotonic 
regression models. Models along the path are less complex, and hence likely 
to be less overfit and offer better predictive performance than the overall 
solution to (1), while still maintaining isotonicity. This is confirmed by our 
analysis of the equivalent degrees of freedom along the IRP path, as well as 
experiments with simulated and real data. 

We observe that for very low dimension (typically d<2) the nonregular- 
ized solution of (1) performs well. As the dimension increases, regularization 
becomes necessary, and intermediate models on the IRP path perform bet- 
ter than the nonregularized solution. However, eventually overfitting plagues 
IRP from its first iteration, and the isotonic models fail to perform better 
than simple linear regression in out-of-sample prediction, even when the 
linear model is inappropriate. In our simulations, this occurs around dimen- 
sions 6-8 even for relatively large data sets. 
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Fig. 1. Illustration of IRP on Baseball data. Salary is modeled by number of runs batted 
in and hits. Models after iterations 1 and 10 of IRP and the final model are shown. 

Progress of IRP is illustrated in Figure 1, where we show an example 
of applying IRP to the well-known Baseball data set [He, Ng and Portnoy 
(1998)] describing the dependence of salary on a collection of player prop- 
erties. We limit the model to only two covariates to facilitate visualization, 
and we choose to use the number of runs batted in and hits since they 
seemed a priori most likely to comply with the isotonicity assumptions. The 
increasing model complexity can be seen, moving from iteration 1 (a sin- 
gle split) through 10 iterations of IRP, to the final isotonic model optimally 
solving (1), comprising a splitting of the covariate space into 29 regions, each 
of which is fitted with a constant. Note that the single split from iteration 1 
creates two flat surfaces (the highest and lowest in the figure), and that the 
in-between surfaces are interpolations in regions with no data points based 
on the two-surface model. The figures for models after 10 and 28 iterations 
are quite complex for the same reason — interpolations in the continuous 
covariate space. 

An obvious analogy of IRP can be made to well-known recursive partition- 
ing approaches for regression such as CART [Breiman et al. (1984)], where 
the iterative splitting of the covariate space generates a sequence of models 
(trees) of increasing model complexity, from which the "best" tree is chosen 
via cross-validation [e.g., using the 1-SE rule Breiman et al. (1984)]. As with 
CART and other similar approaches, IRP performs a greedy search and finds 
a "local" optimum in every iteration. However, unlike CART, which has no 
guarantees on the overall model it generates, IRP is proven to terminate in 
the global solution of the isotonic regression problem (1). Another difference 
is that IRP splits are not made along one axis at a time, but rather each 
split is a nonparametric division of one region in X into two subregions. 

Importantly, while our presentation has so far concentrated on isotonic 
regression with an I2 loss function, an interesting extension of the partition- 
ing scheme can be used to solve large-scale isotonic regressions under other 
loss functions that are of great interest in statistics, for example, logistic 
and poisson log-likelihoods. Specifically, a well-known result of Barlow and 
Brunk [(1972), Theorem 3.1], implies that, for a large class of loss functions 
(formally defined in our Discussion section), the solution of isotonic mode- 
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ling can be found by a simple transformation of the solution to isotonic 
regression with an I2 loss function. This will play an important role in our 
use of isotonic regression on modeling gene-gene interactions in epistasis. 
Indeed, there we will maximize logistic log- likelihood subject to isotonicity 
constraints. 

The remainder of this paper is organized as follows. We first present and 
analyze the IRP algorithm in Section 2. We detail the best cut problem 
solved for splitting at each iteration, and prove that this algorithm is a no- 
regret algorithm, in the sense that it only partitions the data and never 
merges back previously made partitions and converges to the global solu- 
tion of (1) (Theorem 2). Furthermore, we prove that the intermediate par- 
titions generated along the IRP path are also isotonic, in the sense that 
fitting each region to the average gives a model that is in the class G of iso- 
tonic functions in M*^ (Theorem 4). Section 3 briefly reviews the theoretical 
computational guarantees of IRP as reflected in the literature, and develops 
a simple and realistic case where the overall computation is O(n^). Sec- 
tion 4 discusses the statistical model complexity of models generated along 
the regularization path. Meyer and Woodroofe (2000) have shown that the 
number of partitions in the solution of (1) is an unbiased estimator of the 
(equivalent) degrees of freedom [as defined by Efron (1986)]. Since IRP adds 
one to the number of partitions at each iteration, the number of iterations 
may be used as a parametrization of this sequence. However, we argue that 
the number of regions is not a good estimate of degrees of freedom because 
IRP performs much more fitting in its initial iterations compared to later 
stages, and demonstrate this effect empirically through simulation. We also 
show that when the covariates are ternary, that is, covariate Xij € {0,1,2} 
(as is natural in our motivating genetic example when dealing with ternary 
genotype data), the overall number of degrees of freedom and model com- 
plexity increase more slowly with dimension, compared to general contin- 
uous covariates, resulting in much less overall fitting for each dimension. 
Section 5 examines IRPs statistical and computational performance on sim- 
ulated and real data, specifically pointing out the effect of regularization 
and increased dimensionality on predictive performance. We apply IRP to 
simulations with ternary covariates and sub- and super-additive interactions 
motivated by the genetic application and demonstrate its favorable perfor- 
mance. Section 6 applies IRP to real-life data sets from the Wellcome Trust 
Case Control Consortium (WTCCC) and discusses the results [WTCCC 
(2007)]. Section 7 concludes with extensions and connections to previous lit- 
erature. A Matlab-based software package implementing the IRP algorithm 
is available at http://www.tau.ac.il/~saharon/files/IRPvl.zip. 

We next define terminology to be used throughout the paper. 

1.1. Definitions. Let V = {xi, . . . ,Xn} be the covariate vectors for n 
training points where Xi G M'^ and denote T/j G M as the ith. observed re- 
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sponse. We will refer to a general subset of points AQV with no holes (i.e., 
X ^ z and x,z ^ y A) as a group. Denote by \A\ the cardinality 
of group A. The weight of group A is defined as y^^ = |^ Xlj ■ x^gA -'^^^ 
two groups A and B, we denote A^ B if there exists x € A,y B such that 
X ^y and there does not exist x (z A,y B such that y ~< x (i.e., there is at 
least one comparable pair of points that satisfy the direction of isotonicity). 
A set of groups V is called isotonic if A ^ B ^yj^ <y^ for all A,BgV. 
The groups within this set V are referred to as isotonic regions. A sub- 
set C iU) of j4 is a lower set {upper set) of A x G A, y G C, x ^ y ^ x G C 
{x eU,y e A,x ^y ^y eU). 

A group C ^ is defined as a block of group A if yur\B — Vb each 
upper set U of A such that U H B ^ {■} (or, equivalently, if yc^B ^ Vb 
each lower set Cof A such that >Cni? 7^ {•}). A set of blocks 5 = {Bi, . . . ,Bf^} 
is called a block class of V if BiO Bj = {•} and BiU ■ ■ ■ L) B^ = V . 5 is an 
isotonic block class if for all Bi,Bj G 5, Bi ^ Bj =^yBi ^Vsj- group X 
majorizes (minorizes) another group Y if X {X ^Y). A group X is 
a majorant {minorant) of X U ^ where A = IJi=i Ai if X Ai {X )/- Ai) for 
all i = 1, . . . , fc. 

We denote the optimal solution for maximizing a general function f{z) in 
the variable zhy z* , that is, z* = argmax/(2;). 

2. IRP and a regularization path for isotonic regression. We describe 
here the partitioning algorithm used to solve the isotonic regression prob- 
lem (1). Section 2.1 first reformulates the isotonic regression problem and 
describes the structure of the optimal solution. Section 2.2 motivates and de- 
tails the IRP algorithm and, in particular, the main partitioning step. Each 
group created by the partitioning scheme is proven to be the union of blocks 
in the optimal solution, that is, all partitions have the no-regret property. 
An important aspect of the algorithm is the regularization path generated 
as a byproduct as each partition creates a new feasible solution. Section 2.3 
goes on to prove convergence of IRP to the global optimal solution of (1), 
and most importantly, that each solution along the regularization path is 
isotonic. 

2.1. Structure of the isotonic solution. Isotonic regression seeks a mono- 
tonic function that fits a given training data set {(a^i, 2/i)}iLi ^-rid satisfies 
a set of isotonicity constraints which we index by the set X = {(i, j) :xi<Xj}. 
We will usually assume that Xj G M*^ and that < is the standard partial order 
in based on coordinate- wise inequalities. A reformulation of (1) is 

(2) min|^(yj -yif:yi <yj for all gx|. 

Problem (2) is a quadratic program with linear constraints. Any solution 
satisfying the constraints given by I is referred to as an isotonic, or fea- 
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sible, solution. The structure of the optimal solution to (2) is well known. 
Observations are divided into k groups where the fits in each group take 
the group mean observation value. This can be seen through the follow- 
ing Karush~Kuhn-Tucker (KKT), that is, optimality, conditions [Boyd and 
Vandenberghe (2004)] to (2): 

(a) yi=yi - (i,j)&X ^ij (j,i)&X -^J*) ' 

(b) Vi < Vj for ah gX, 

(c) Xij>0 for ah 

(d) Xijivi -yj)=0 for ah (i, j) G I, 

where Xij is the dual variable corresponding to the isotonicity constraint 
Vi < Vj- This set of conditions exposes the nature of the optimal solution. 
Condition (d) implies that Xij > ^ yi = yj , meaning Xij can be nonzero 
only within blocks in the isotonic solution which have the same fitted value. 
For observations in different blocks, Xij = 0. Furthermore, the fit within 
each block is trivially seen to be the average of the observations in the 
block, as the average minimizes the block's squared loss. A block is thus 
also referred to as an optimal group with respect to an isotonic regression 
problem. Condition (b) implies isotonicity of the blocks, and thus, we get the 
familiar characterization of the isotonic regression problem as one of finding 
a division into an isotonic block class. 

2.2. The partitioning algorithm. Suppose a current group V is optimal 
(i.e., F is a block) and, thus, the optimal fits at points in V, denoted sat- 
isfy y* = Jjy for all i which leads to the condition J2ieV ~ ^v) — 0- 
Then finding two groups A and B within V such that 2iGB iUi ~yv) ~ 
X^ieA iVi ~yA) > should be infeasible, according to the KKT conditions. 
The division in IRP looks for two such groups. Denote by Cy = {{A,B) : 
A,B <^V,A\J B = V,Alr^ B = {•}, and there does not exist x € A,y € B 
such that y ^ x} the set of all feasible (i.e., isotonic) partitions defined 
by observations in V. We refer to partitioning as making a cut through 
the variable space (hence, our optimal partition is made by an optimal 
cut). The optimal cut is determined by the partition that solves the prob- 
lem 

^^max ^ 9* (A, B) = ^^max ^ { (,, -yy)-Y^y^-yv) 
(3) 

= -\M{vA-yv) + \B\{yB-yv)^ 

where A{B) is the group on the lower (upper) side of the edges of the cut. 
A more statistically intuitive rule might look for the split that maximizes 
between-group variance. This partitioning problem solves 

..^f^. ^(^'^) = r..^f^. A\^\^yA-yvf + \B\{yB-yvf}■ 
{A,B)£Cv {{A,B)&Cv} 
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The next proposition makes a connection between the above two maxi- 
mization problems, and draws a clear conclusion on the relationship between 
their optimal solutions, namely, that the optimal partitions to (3) are always 
more balanced than the optimal partitions to (4). 

Proposition 1. Denote the optimal solutions of the optimal cut prob- 
lem (3) and the between-group variance maximization problem (4) by (A* ,B*) 
and {A,B) and their objective functions by g*{A,B) and g{A,B), respec- 
tively. Then 

{A*,B*) = argmax {\A\\B\g{A,B)} 

{A,B)eCv 

and 

{\A*\-\B*\f <{\A\-\B\f. 
We leave the proof to the Appendix. 

Thus, we can look at the IRP criterion as a modified form of maximizing 
between-group variance which encourages more balanced splitting. How- 
ever, while solving the partition problem (4) is difficult, the IRP partition 
problem (3) is tractable. Indeed, the optimal partition problem (3) can be 
reduced to solving the linear program 

(5) max{c"^z : Zi < zj for all {i,j) € X, —1 < Zi <1 for all i S V}, 

where Ci = yi — yy. If the optimal objective value equals zero, then the 
group V must be an optimal block. 

This group-wise partitioning operation is the basis for our IRP algorithm 
which is detailed in Algorithm 1.^ It starts with all observations as one group 
and recursively splits each group optimally by solving subproblem (5). At 
each iteration, a list C of potential optimal partitions for each group gener- 
ated thus far is maintained, and the partition among them with the highest 
objective value is performed. The list C is updated with the optimal parti- 
tions generated from both subgroups. Partitioning ends whenever the solu- 
tion to (5) is trivial (i.e., no split is found because the group is a block). We 
can think of each iteration k of Algorithm 1 as producing a model by 
fitting the average to each group in its current partition: for a set of groups 
V = {Vi,..., Vk}, denote = {yv^,- ■ ■ ,yvj- Then model A4 = (V,yv) con- 
tains the partitioning V as well as a fit to each of the observations, which is 
the mean observation of the group it belongs to in the partition. 

2.3. Properties of the partitioning algorithm. Theorem 2 next states the 
result which implies that the IRP partitions are no-regret. This will lead to 
our convergence result. 



^This algorithm appeared in a shorter version of the paper [Luss, Rosset and Shahar 
(2010)]. 
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Algorithm 1 Isotonic Recursive Partitioning 
Require: Observations {xi,yi), . . . , {xn,yn) and partial order I. 
Require: A: = 0,^ = C = {(0, {xi, ... {•})}, B = {■}, 

Mo = {A,y^). 
1: while A ^ {•} do 

2: Let {val,w~ ,w~^) G C be the potential partition with largest val. 
3: Update A = {A\{w- Uw+)) U {w-,w+}, C = C\{val,w- ,w+). 
4: Mk = iA,y^).^ 
5: for all V e {w~ ,w^} \ {■} do 

6: Set Ci = yi — y^ for all i such that rcj € u where y^ is the mean of 

the observations in set v. 
7: Solve LP (5) with input c and get z* = argmaxLP(5). 
8: if z| = • • • = (group is optimally divided) then 
9: Update ^ = ^\t; and S = SU{i;}. 

10: else 

11: Let v~ = {xi : z* = = {xi : z* = +1}. 

12: Update C = CU{(c^z*, i;-, -{;+)} 

13: end if 

14: end for 

15: k = k + l. 

16: end while 

17: return B, a partitioning of observations corresponding to the optimal 
groups. 



Theorem 2. Assume group V is the union of blocks from the opti- 
mal solution to problem (2). Then a cut made by solving (3) [using (5)] at 
a particular iteration does not cut through any block in the global optimal 
solution. 

The fact that IRP is a no-regret algorithm can be shown using a connec- 
tion between the work of Barlow and Brunk (1972) and Maxwell and Muck- 
stadt (1985) (held to the Discussion in Section 7). We prove Theorem 2 
directly, but leave it to the Appendix, as the theorem is already known 
to be true [Spouge, Wan and Wilbur (2003)]. Remark 7 in the Appendix 
handles the case for multiple observations. Since Algorithm 1 starts with 
A = {{xi, . . . which is a set of one group that is the union of all 

blocks, we can conclude from this theorem that IRP never cuts an optimal 
block when generating partitions. The following corollary is then a direct 
consequence of repeatedly applying Theorem 2 in Algorithm 1: 

Corollary 3. Algorithm 1 converges to the optimal (isotonic block 
class) solution with no regret. 
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Fig. 2. Illustration of the proof of Theorem 4 showing the defined sets at iteration k + 1. 
G is the set divided at iteration k + 1 into A (all blue area) and B (all green area). The 
group bordering A from below denoted by Xi (also referred to as D in the proof) is in 
violation with A. At iteration ko, G is part of the larger group H and X\ is part of the 
larger group F. At iteration ko, groups F and H are separated. The proof shows that 
when A and B are split at iteration k + 1, no group such as Xi where wxi > voa could 
have existed. In the picture, Xi must have been separated at an iteration fco < fc + 1, but 
the proof, through contradiction, shows that this cannot occur. 

Theorem 4 next states our main innovative result that Algorithm 1 pro- 
vides isotonic solutions at each iteration. This result implies that the path 
of solutions generated by IRP can be regarded as a regularization path for 
isotonic regression. Along the path, the model grows in complexity until op- 
timality. These suboptimal isotonic models often result in better predictive 
performance than the optimal solution, which is susceptible to overfitting as 
is discussed in Section 5. 

Theorem 4. Model generated after iteration k of Algorithm 1 is in 
the class Q of isotonic models. 

Proof. The proof is by induction. The base case, that is, first iteration, 
where all points form one group is trivial. The first cut is made by solving 
the linear program (5) which constrains the solution to maintain isotonicity. 

Assuming that iteration k (and all previous iterations) provides an iso- 
tonic solution, we prove that iteration k + 1 must also maintain isotonicity. 
Figure 2 helps illustrate the situation described here. Let G be the group 
split at iteration k + 1 and denote A (B) as the group under (over) the cut. 
Let A = {X : X is a group at iteration k + 1, there exists Xi € X such that 
{i,j) € I for some xj G A} (i.e., X A border A from below). 

Consider iteration A; + 1. Denote X = {X ^ A'.yj^ < Vx} (i-6-, X X 
violates isotonicity with A). The split in G causes the fit in nodes in A 
to decrease. We will prove that when the fits in A decrease, there can be 
no groups below A that become violated by the new fits to A, that is, the 
decreased fits in A cannot be such that X ^ {•}. 



12 



R. LUSS, S. ROSSET AND M. SHAHAR 



We first prove that X = {■} by contradiction. Assume X ^ {•}. Denote 
i < A; + 1 as the iteration at which the last of the groups in X, denoted D, was 
spht from G and suppose at iteration i, G was part of a larger group H and D 
was part of a larger group F. It is important to note that X n (F U H) = {•} 
for six X ^ X\D aX iteration i because by assumption all groups m. X\D 
were separated from A before iteration i. Thus, at iteration i, D is the only 
group bordering A that violates isotonicity. 

Let Du denote the union of D and all groups in F that majorize D. By 
construction, Djj is a majorant in F. Hence, < Vfuh t>y Algorithm 1 
and < y^^^ by definition since y^j^ > y^, > yj^. Also by construction, any 
set X £ H that minorizes A has Ijx < IJa (each set X that minorizes A 
besides D such that Ijx < Va already been split from A). Hence, we can 
denote Al as the union of A and all groups in H that minorize A and we 
have Ijy^ > yAi^ s-'^d Al \s a, minor ant in H. Since Al Q H at iteration i, we 
have 

Vfuh < VAl <yA< VDu < Vfuh, 

which is a contradiction, and hence the assumption 7^ {•} is false. The first 
inequality is because the algorithm left Al in H when F was split from H, 
and the remaining inequalities are due to the above discussion. Hence, the 
split at iterations k + 1 could not have caused a break in isotonicity. 

A similar argument can be made to show that the increased fit for nodes 
in B does not cause any isotonic violation. The proof is hence completed by 
induction. □ 

With Theorem 4, the machinery for generating a regularization path is 
complete. In Section 3 we describe the computational complexity for gener- 
ating this path, followed by a discussion of the statistical complexity of the 
solutions along the path in Section 4. 

3. Complexity. We here show that the partitioning step in IRP can be 
solved efhciently. The computational bottleneck of Algorithm 1 is solving 
the linear program (5) that iteratively partitions each group. The linear 
program (5) has a special structure that can be taken advantage of in order 
to solve larger problems faster. Indeed, the dual problem can be written as 
an optimization problem called a network flow problem that is amenable 
to very efficient algorithms, as noted by Sponge, Wan and Wilbur (2003) 
who recognize the network flow problem as the maximal upper set prob- 
lem. We note that our partition problem (5) is very similar to the network 
flow problem solved in Chandrasekaran et al. (2005) where Zi there repre- 
sents the classification performance on node i. We denote the complexity 
of solving the linear program (5) by T{m,n), where m is the number of 
constraints defined by I and n is the number of observations. Various ef- 
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ficient algorithms for solving this problem exist, giving complexities such 
as T(m,n) = 0(mn log n) [Sleator and Tarjan (1983)] along with several al- 
gorithms giving T{m,n) = O(n^) [Galil and Naamad (1980)]. Choosing the 
more efficient implementation depends on the number of isotonicity con- 
straints m [e.g., < mnlogn for the worst case m = 0(n^)]. A recent re- 
sult by Stout (2010) shows how to represent an isotonic regression problem 
by an equivalent problem where both the number of total observations and 
constraints are of order 0(n log*^"^ n), which greatly reduces the worst case 
of m = O(n^) isotonicity constraints (i.e., by trading off a few additional 
shadow observations for a large reduction in the number of constraints). 
Since at most n partitions are made by IRP, complexity is O(n^) using 
T{m,n) = 0{n^) or reduced to 0(n^ log^'^""'^ n) using the results of Stout 
(2010). 

In practice, the complexity can be even better by accounting for the fact 
that IRP solves a sequence of partitioning problems that are decreasing 
in size (i.e., problems with fewer and fewer observations). Each partition in 
Algorithm 1 can be divided into different proportions. We generically denote 
the bigger proportion in a partition by p > 0.5. Proposition 5 next gives 
a bound on the complexity of Algorithm 1 for this general case [assuming 
T(m,n) = 0{n^)], in terms of the maximal p over all partitions. 

Proposition 5. Let Pmax ^ 0.5 be the greatest p over all iterations of 
Algorithm 1 such that iteration k partitions a group of size n/j into two groups 
ofsizeprik and {l—p)n}^. Denote byn the total number of observations. Then 
the complexity of Algorithm 1 is bounded by 

(6) 0(n3^ ^ 



1 — P- 



max 



The proof, given in the Appendix, is based on the fact that the sequence 
of IRPs partition problems are solved on smaller and smaller groups of 
observations [i.e., while the ffi'st partition problem is O(n^), the partition 
problems for the two created partitions are 0{p'^n^) and 0((1 — p)^n^) for 
some p where < p < 1]. Even at Pmax = 0.99, the constant 1/(1 — Pmax) ^ 50, 
which is very small when the number of observations is large. Thus, under 
another reasonable assumption that Pmax is bounded, we can conclude that 
IRP is of practical complexity 0(n^)/(l — Pmax)- Similar analysis using the 
results of Stout (2010) lead to a practical complexity of 0(n^log^'^~^ n)/ 

(1 -Pmax)- 

Additional enhancements to the algorithm can be made in order to solve 
even larger problems than done here. As noted above, the dual problem to 
our optimal partition problem is a specially structured network flow prob- 
lem. Luss, Rosset and Shahar (2010) show that the network flow problem 
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[i.e., the dual to problem (5)] can be decomposed and solved through a se- 
quence of smaller network flow problems. This reduction of the optimal 
partition problem makes IRP a practical tool for even larger data sets than 
experimented with here; refer to Luss, Rosset and Shahar (2010) for details 
on the large-scale decomposition which is beyond the scope of this paper. 

4. Degrees of freedom of isotonic regression and IRP. The concept of 
degrees of freedom is commonly used in statistics to measure the complexity 
of a model (or more accurately, a modeling approach) . This concept captures 
the amount of fitting the model performs, as expressed by the optimism of 
the in-sample error estimates, compared to out-of-sample predictive perfor- 
mance. Here we briefly review the main ideas of this general approach, and 
then apply them to isotonic regression and IRP. 

Following Efron (1986) and Hastie, Tibshirani and Friedman (2001), as- 
sume the values xi, . . . , a;„ € M'^ are fixed in advance (the fixed-x assump- 
tion), and that the model gets one vector of observations y = {yi, ■ ■ ■ , yn)"*" Gl^" 
for training, drawn according to P{y\x) at the n data points. Denote by y"^™ 
another independent vector drawn according to the same distribution, y is 
used for training a model /(x) and generating predictions yi = f{xi) at the n 
data points. 

We define the in-sample mean squared error, 

MRSS = -||y-y||i, 
n 

and compare it to the expected error the same model incurs on the new, inde- 
pendent copy, denoted in Hastie, Tibshirani and Friedman (2001) by ERRin, 

ERRin = — Eynew ||y"™ — ylli- 
n 

The difference between the two is the optimism of the in-sample prediction. 
As Efron (1986) and others have shown, the expected optimism in MRSS is 

(7) £;y,yncw (ERRin - MRSS) = ^ J^cov(2/,,yO. 

i 

For linear regression with homoskedastic errors with variance a^, it is easy 
to show that (7) is equal to f rfo"^, where d is the number of regressors, 
hence the degrees of freedom. This naturally leads to defining the equivalent 
degrees of freedom of a modeling approach as 

(8) d/ = ^cov(yi,yj)/cr^. 

i 

In nonparametric models, one usually cannot calculate the actual degrees 
of freedom of a modeling approach, but it is often easier to generate unbiased 
estimates df of df using Stein's lemma [Stein (1981)]. Meyer and Woodroofe 
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(2000) demonstrate the applicability of this theory in shape-restricted non- 
parametric regression. Specifically, their Proposition 2, adapted to our nota- 
tion, implies that if we assume the homoskedastic case var(yj) = for all i, 
then the unbiased estimator df for degrees of freedom in isotonic regression 
is the expected number of pieces D in the solution y to (2), that is, 

i 

Considering the IRP algorithm, this puts us in the interesting situation 
where the number of steps the algorithm takes until it terminates in the 
globally optimal isotonic solution is equal to the degrees of freedom esti- 
mator of this global solution (minus one, since we start with one piece). 
One might thus be inclined to assume that each iteration of Algorithm 1 
adds about one degree of freedom, that is, performs approximately the same 
amount of fitting in every iteration. A similar idea is represented by the 
degrees of freedom calculation of Schell and Singh (1997) in their reduced 
monotonic regression algorithm (which starts from the complete isotonic fit 
and eliminates pieces). 

On more careful consideration, however, it is obvious that this idea is 
incorrect since the first iteration of IRP finds an optimal cut in the very 
large space of all possible multivariate isotonic cuts. For comparison, a single 
deep split in a regression tree has been estimated to consume three or more 
degrees of freedom [Ye (1998)], and the space of possible splits in initial 
IRP iterations is much larger than that of a regression tree since IRP splits 
are not limited to being axis-oriented. Thus, intuitively, the first iteration is 
expected to use much more than one degree of freedom (the equivalent of 
fitting one coefficient to a fixed, pre- determined regressor). This effect should 
be exacerbated as the dimension d x increases since the size of the search 
space for isotonic cuts increases with it. It also inevitably implies that the 
latter iterations of the IRP algorithm should perform less (ultimately much 
less) fitting than the equivalent of one degree of freedom in every iteration 
in order to be consistent with the unbiasedness oi df = D as an estimator 
oidf. 

Here we demonstrate empirically that this is indeed the case. We simulate 
data from a simple additive model 

(9) Xij~^[l,2] i.i.d., 

(10) 2/, = ^x2.+AA(0,10), 

3 

where Xij is dimension j of the observation i. We can generate one fixed 
copy of data using (9), and then repeatedly generate observations using (10), 
apply IRP, and empirically estimate df as defined by (8) for every iteration of 
IRP. Figure 3(left) shows how df evolves in this model as the IRP iterations 
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1000 Training Samples 1000 Training Samples 




Length of Path Length of Path 



Fig. 3. Evolution of degrees of freedom for IRP as model complexity increases. Both 
models use yi = J2jXij + Af (0,10). Simulation (left) uses Xij '^ZY[1,2] and simulation 
(right) uses Xij € {0, 1, 2} with probabilities {1/3, 1/3, 1/3}. Each path is the mean over 50 
trials with 1,000 training samples. 

proceed, for increasing dimensions of x. The covariance in (8) was estimated 
by drawing values X = (xi, . . . , j;i^ooo) according to the model (9), fixing 
them, and repeatedly drawing 1,000 independent copies of y|X according 
to (10), and applying IRP on each one. 200 simulations were run with one 
drawing of X and the results were averaged. As expected, we see that the 
number of pieces (hence degrees of freedom) in the final isotonic regression 
increases with the dimension, as does the rate in which the number of degrees 
of freedom increases in the initial steps of IRP. 

In order to emphasize this dependence of the degrees of freedom in initial 
iterations on the dimension, as well as on the number of observations, Fig- 
ure 4 presents the evolution of the percentage of the total isotonic regression 
degrees of freedom along the path (i.e., number of degrees of freedom rel- 
ative to the number of partitions of the final model) as a function of both 
the dimension and the amount of data used. As expected, increasing the 
dimension radically increases the portion of the fitting in the first steps, 
while increasing the amount of data decreases this portion (since the overall 
isotonic fit is generally more complex in these situations). It should be noted 
that for many of the situations examined, IRP performs more than half of 
the total isotonic fit, as measured by degrees of freedom, in its first iteration! 
In dimension 7, even at n = 20,000 observations, almost 60% of the total fit- 
ting is associated with the first iteration. Thus, these simulations clearly 
demonstrate the nature and limitations of IRPs regularization behavior: the 
IRP path contains models that are regularized isotonic models compared to 
the global solution, but IRPs ability to control model complexity is limited 
by the concentration of most of the fitting in the initial iterations, especially 
in higher dimension. 
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Fig. 4. Percentage of degrees of freedom relative to the full path (i.e., number of partitions 
in the optimal solution) as model complexity increases. Simulations use iHij ^U[l,2] with 
fji = Xij + A/'(0, 10). Each path is the mean over 200 trials. Only the first 500 partitions 
of the paths are displayed in order to make the MSB of the earlier IRP iterations visually 
clearer. 



As mentioned in the Introduction, an area that combines applications 
where the isotonic assumptions are reasonable (i.e., low bias) and the over- 
fitting may be of less concern (i.e., variance can be controlled) is in genetics, 
specifically in modeling gene-gene interactions in phenotype(y)-genotype(X) 
relationships [Cordell (2009)]. The key observation here is that genotypes are 
ternary (xij € {0,1,2} copies of the "risk" allele). Thus, each dimension of 
the predictor space X can take only one of three possible observed values 
in the data. Intuitively, it is clear that this would significantly reduce the 
space of possible isotonic splits in IRP, and hence reduce the amount of 
fitting. To demonstrate this empirically. Figure 3(right) displays an exper- 
iment with the same setup, where instead of drawing the x values from 
a multivariate uniform, they are drawn independently from {0,1,2} with 
equal probabilities and we use the same model. Figure 3 demonstrates that 
both the globally optimal isotonic regression and, especially, the first IRP 
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iterations perform much less overall fitting in the ternary case versus the 
continuous measured by equivalent degrees of freedom. For exam- 

ple, in six dimensions, the continuous case requires almost seven times as 
many degrees of freedom than in the ternary case to fit the model. However, 
relative to the final model, a large percentage of the fitting still takes place 
in the initial iterations. 

5. Performance evaluation. We demonstrate here the usefulness of iso- 
tonic regression on simulation and real data. For each experiment, IRP is run 
on the training data and produces a path of isotonic models. Each model 
is used for prediction on the test data and the root mean squared error 
(RMSE) is recorded. This generates paths of root mean squared errors over 
the different isotonic models and is illustrated in the figures below. In each 
table, we record the minimum RMSE along these paths (IRP min RMSE), 
along with how many partitions were made to generate this minimum RMSE 
(IRP min path), and the number of partitions in the global isotonic solu- 
tion (IRP path length). IRP, as well as optimal isotonic regression, results 
are compared to running a least squares regression on the training data 
and predicting on the testing data with the resulting linear model (corre- 
sponding RMSE is called LS RMSE), and to the performance of the global 
isotonic regression solution {Isotonic regression RMSE). Because we are in- 
terested in examining the behavior of the entire IRP path for use in selecting 
the optimal tuning parameter, and to avoid a significant increase in running 
time, we do not employ cross-validation for selecting the best stopping point 
(number of iterations), but use test sets for this. In practical application, 
cross-validation would be the appropriate approach for selecting the best 
model for prediction. 

5.1. Simulations. We first illustrate isotonic regression on simulated data 
with different distributions. For the first three experiments, the ith obser- 
vation's regressors are distributed as Xij ~^[0,5], Xij ~Z//[0,3], and Xij ~ 
W[0,2], respectively, and in all cases, all Xij cirG 1.1. d. Responses yi for the 
three simulations are generated as 

(1) y,= (^x|)+A/-(0,4d2), (2) y,= (^llx,j^+M{0,d^) 
and 

(3) y, = 2^.^-+A/-(0,d2), 

respectively, where d is the dimension. The first simulation represents an ad- 
ditive (nonlinear) model and the other two simulations are "super-additive" 
models (i.e., represent strong positive interactions which are hard to approx- 
imate with additive effects). This allows us to examine the performance of 
IRP and isotonic regression in a variety of relevant situations. 
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The last two experiments are ternary. The ith observation's regressors are 
distributed as Xij G {0,1,2} with probabihties {0.7,0.2,0.1} and {1/3,1/3, 
1/3} for the fourth and fifth experiments, respectively. The fourth model is 
subadditive while the fifth model is super additive; specifically, they are 



(4) yi=[Y,''ij) +AA(0,dVlO) and (5) Vi = (HxiA +M{0,(f). 



Model 4 is chosen to simulate "subadditive" genetic interactions as dis- 
covered by Shao et al. (2008), where having one risk variant is sufficient 
to attain most of the effect, and additional variants have little additional 
influence. Model 5 represents the "superadditive" model, where variants ex- 
acerbate each other's effect, as often speculated to be the case in human 
disease. Note that, for each of 50 simulations, 12,000 training and 3,000 
testing points were randomly generated and statistics computed (all tests 
are out-of-sample) . 

Figure 5 demonstrates testing error for IRP over the regularized path of 
isotonic solutions for the first three experiments (with continuous covari- 
ates). The main observation here is that as the dimension increases, the 
effect of overfitting of the standard (nonregularized) isotonic regression be- 
comes more significant and causes the skewed [/-shaped pattern across the 
IRP path, where the minimum prediction RMSE is obtained earlier in the 
path. This is the effect we alluded to in the Introduction and it stems from 
the limitations of the isotonicity constraints in controlling model complexity 
in high dimensions. 

Table 1 displays certain statistics on all five simulations as well as a com- 
parison to the results of a least squares regression. We first discuss the case 
of continuous covariates (first three models). In lower dimensions standard 
isotonic regression performs well, and regularization through IRP offers no 
gain (this is seen in dimension d = 2 in all three examples). Here, isotonic 
regression controls bias by accommodating the nonlinearities in the true 
model and significantly outperforms least squares regression. As the num- 
ber of covariates increases, regularization through IRP becomes necessary to 
control variance, and the optimal performance is obtained earlier in the IRP 
path (dimension d = 4 in our examples). When d increases further, however, 
IRP also becomes inefficient at controlling variance, and linear regression 
dominates. This effect can be traced back to the large amount of fitting 
performed by IRP already in its initial iterations, as demonstrated in the 
previous section. 

With respect to the models with ternary covariates, isotonic regression 
outperforms the simple linear regression, however, the IRP path does not 
statistically improve performance. For the subadditive model (model 4), 
performance is better for dimensions 2 and 4, after which again IRP is 
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Fig. 5. Root mean squared error (RMSE) for out-of-sample predictions of simulations 
with different dimensions d. The x-axis m each figure corresponds to the number of parti- 
tions made by IRP, that is, the curves show how the RMSE of test data varies as the IRP al- 
gorithm progresses. Model 1 uses the function yt — (^^. Xij) + A/'(0, 4d^) with Xij ~ W[0, 5] . 
Model 2 uses the function yt = (11^2;;^) -\- JV{0,d^) with Xij ~ W[0,3]. Model 3 uses the 
function yi = 2^i^'^ -\- J\f{0,d'^) with Xij ~W[0,2]. Fifty simulations were run with 12,000 
training and 3, 000 testing points. Only the first 750 partitions of the paths are displayed 
in order to make the RMSE of the earlier IRP iterations visually clearer. 



unsuccessful at controlling variance. However, in the superadditive model 
(model 5), IRP dominates for all dimensions. This is not surprising, since 
superadditive models are more extreme in their deviations from additivity, 
making the flexibility of IRP more critical. 

Thus, our simulations confirm that isotonic regression performs well in 
low dimensions, but requires a lot of data in order to learn good nonlin- 
ear models in higher dimensions. In intermediate dimensions, IRP can offer 
a compromise between fitting flexible isotonic models and controlling model 
complexity, resulting in useful prediction models. 

5.2. Modeling MPG of automobiles. We next illustrate the performance 
of IRP when predicting the miles-per-gallon of a list of 392 automobiles man- 
ufactured between 1970 and 1982 using seven variables [Frank and Asuncion 
(2010)]. Seven regressions are performed in dimensions one through seven. 
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Table 1 

Statistics for simulations generated by the three different models as labeled above. A path 
of root mean squared errors (RMSE) for each model along the regulanzation path was 
computed. IRP mm RMSE refers to the minimum RMSE along these paths. Values for 
RMSE are given along with a conservative 95% confidence interval. LS RMSE is the 
RMSE from using least squares regressions. IRP min path is the number of partitions 
made to generate the minimum RMSE and IRP path length is the number of partitions in 
the global isotonic solution. Bolded RMSE values for IRP and isotonic regression indicate 
that they are lower than the RMSE of the least squares regression with 95% confidence 

Number of IRP min Isotonic regression LS IRP min IRP path 

variables RMSE RMSE RMSE path length 

Model 1: y, = {Y.j x%)+N{OAd^) with a;^ ~ W[0,5] 

2 4.06 (±0.02) 4.06 (±0.02) 4.80 (±0.01) 625 999 

4 8.64 (±0.03) 8.67 (±0.04) 8.83 (±0.03) 28 4,861 

6 14.27 (±0.06) 14.64 (±0.07) 12.87 (±0.04) 8 8,520 

8 24.41 (±0.11) 25.61 (±0.12) 16.89 (±0.05) 4 10,604 

Model 2: y, = {Yl^x.j) +U{0,d'') with a;^ ~W[0,3] 

2 2.01 (±0.01) 2.01 (±0.01) 2.13 (±0.01) 44 437 

4 4.67 (±0.03) 4.73 (±0.04) 6.07 (±0.03) 18 3,711 

6 19.79 (±0.24) 29.43 (±0.97) 19.62 (±0.29) 3 7,685 

8 76.23 (±2.08) 197.71 (±16.49) 65.40 (±2.24) 2 10,153 

Model 3: y^ = 2^3 "^'J +U{Q,d'^) with Xij ~W[0,2] 

2 2.02 (±0.01) 2.02 (±0.01) 2.17 (±0.01) 62 628 

4 6.67 (±0.08) 7.06 (±0.14) 10.10 (±0.09) 19 7,558 

6 88.59 (±1.13) 129.91 (±4.13) 70.77 (±1.21) 4 11,787 

Model 4; yi = (^^ :rij)^/''±A/'(0, d^/10) with x.j G {0,1,2} with probabilities {0.7,0.2,0.1} 

2 0.63 (±0.00) 0.63 (±0.00) 0.67 (±0.00) 8 8 

4 1.27 (±0.01) 1.27 (±0.01) 1.29 (±0.01) 9 30 

6 1.91 (±0.01) 1.91 (±0.01) 1.92 (±0.01) 4 85 

8 2.55 (±0.01) 2.56 (±0.01) 2.54 (±0.01) 5 267 

Model 5; yi = (JJ^ Xij) ± AA(0, d^) with Xij G {0, 1, 2} with probabilities {1/3, 1/3, 1/3} 

2 2.00 (±0.01) 2.00 (±0.01) 2.11 (±0.01) 4 5 

4 4.00 (±0.02) 4.00 (±0.02) 4.47 (±0.02) 6 25 

6 6.04 (±0.02) 6.04 (±0.02) 7.27 (±0.05) 87 103 

8 8.30 (±0.04) 8.30 (±0.04) 10.90 (±0.21) 67 430 



where the variables chosen are from the fohowing order: origin (American, 
European or Japanese), model year, number of cylinders, acceleration, dis- 
placement, horsepower, and weight. The order of the variables was deter- 
mined in order of the magnitude of coefficients from a least squares linear 
regression on all variables. Origin, surprisingly, had the largest magnitude, 
and in giving discrete variables 1, 2, 3 to the respective origins, there actu- 
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Table 2 

Statistics for auto mpg data. Miles-per-gallon is regressed on seven potential variables: 
origin (American, European or Japanese), model year, number of cylinders, acceleration, 

displacement, horsepower, and weight. Row k uses the first k variables from this list in 
the regression. A path of root mean squared errors for each model along the regularization 

path was computed. Bold demonstrates statistical significance of either IRP or isotonic 
regression over a least squares regression with 95% confidence, as determined by a paired 
t-test using 392 observed squared losses obtained from leave-one-out cross-validation 



Number of 


IRP min 


Isotonic regression 


LS 


IRP min 


IRP path 


variables 


RMSE 


RMSE 


RMSE 


path 


length 


1 


6.46 (±0.60) 


6.50 (±0.43) 


6.46 (±0.49) 


9 


17 
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4.91 (±0.82) 


4.95 (±0.37) 


5.24 (±0.37) 


7 


26 
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3.73 (±1.17) 


3.76 (±0.36) 


4.02 (±0.38) 
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37 
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3.92 (±0.40) 


15 


109 
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3.28 (±1.43) 


3.37 (±0.37) 


3.77 (±0.38) 


15 


114 


7 


3.29 (±1.43) 


3.36 (±0.37) 


3.37 (±0.33) 


8 


128 



ally is a monotonic trend in origin (i.e., American cars are least fuel efficient, 
followed by European cars, with the Japanese being most efficient). While 
we include origin as a variable here, we note that similar performance for 
IRP was achieved without origin in an independent experiment. 

Since the data set is rather small, we perform leave-one-out cross-validation 
(i.e., the data is divided into training and testing sets of 391 and 1 instances, 
respectively, so that each instance is used out-of-sample once). Table 2 dis- 
plays certain statistics on the IRP, and isotonic regression, performance as 
well as a comparison to the results of a least squares regression. Figure 6 
displays MSE on out-of-sample data for IRP over the regularized path of 
isotonic solutions for a regression with six variables, exemplifying that over- 
fitting occurs after 15 iterations of IRP (seen by the [/-shaped curve with 
minimum at 15 iterations). 

6. The search for gene gene interactions. As mentioned in Section 1, 
the search for gene-gene interactions (epistasis) is a major endeavor in the 
genetics community, with the goal to identify the mechanisms that connect 
genotypes and phenotypes of interest, including height and disease. 

It is generally acknowledged that the search has so far yielded very lim- 
ited actual findings of major and impactful epistasis in human phenotypes 
[Cordell (2009)]. Where significant interactions have been found, like re- 
cently by Zhang, Zhang and Liu (2011), these were often limited to close-by 
regions in the genome, where statistical and biological interactions are hard 
to differentiate (because mutation distributions are correlated due to linkage 
disequilibrium). If our interest is focused on biological interactions (e.g., two 
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Fig. 6. Root mean squared error (RMSE) for auto data with six variables using IRP 
illustrates that overfitting occurs after 15 iterations of IRP, that is, RMSE decreases un- 
til 15 partitions are made, after which RMSE begins to increase. The figure displays only 
the first 30 partitions so that the U-shape is clear. 

mutations influencing each other's causative effects on the phenotype), we 
should be particularly interested in finding epistasis between noncorrelated 
mutations. 

The limited findings in the literature may be due to two distinct reasons: 
first, the difficulty of searching through the space of all possible combina- 
tions of mutations. For example, in typical genome wide association studies 
(GWAS), one genotypes hundreds of thousands of single nucleotide polymor- 
phisms (SNPs), yielding billions of potential two-way interactions, and much 
larger numbers of higher order interactions. Heuristics for searching, like 
limiting the search to interactions between SNPs that are individually asso- 
ciated with the phenotype, may miss combinations with weak main effects 
but strong interactions. Second, the limitation to simple statistical models 
like chi-square tests and logistic regression with explicit interactions may not 
allow identifying complex high-order interactions even within the searched 
space. IRP offers a remedy to this second concern, in allowing the modeling 
of complex interactions subject to isotonicity only. As our simulations have 
shown, good performance on ternary data is expected up to dimension six 
and even beyond when a truly strong epistatic signal is present. 

To empirically examine epistasis in human disease using IRP, we ob- 
tained data from the Welcome Trust Case Control Consortium [WTCCC 
(2007)], encompassing around 2,000 cases for each of three diseases: Crohn's 
disease (CD), Type-I diabetes (TID), and Type-II diabetes (T2D). These 
are compared to 3,000 healthy controls. All of these samples were geno- 
typed for around 500,000 SNPs, and each phenotype (disease) was analyzed 
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for association between each SNP status and case-control status using chi- 
square and logistic regression. For each disease, between five and nine sig- 
nificant SNPs were discovered after careful multiple comparison corrections 
[WTCCC (2007)]. 

We discuss in detail our modeling of the CD data set. It comprises 2,000 
cases and 3,000 controls. Our covariates include nine unlinked SNPs from 
seven different chromosomes, which were discovered as significant after a mul- 
tiple comparison correction in the WTCCC GWAS and at least one other 
study [Hindorff et al. (2011)]. These are genotypes, that is, ternary vari- 
ables in {0,1,2}. This is a classification problem with binary response (i.e., 
Ui G {0, 1} for control vs case), and we thus model the risk of Crohn's disease 
with an isotonic logistic regression, rather than an I2 isotonic regression as 
we have done for continuous regressions. Specifically, we fit isotonic models 
by maximizing the in-sample logistic log-likelihood rather than the sum of 
squares: 

(11) maxi^Y^Vilog {pi) + {I -yi)log{l- Pi): Pi <pj for all Gzj. 

As explicated by Bacchetti (1989) and Auh and Sampson (2006), the solution 
to (11) is identical to the solution from solving the I2 squared loss isotonic 
regression problem (2) when the values Ui £ {0, 1}. Since we solve I2 isotonic 
regression (2) using the IRP algorithm, we can use it to solve the isotonic 
logistic regression problem. 

To evaluate isotonic model performance, we compare cross-validated area 
under the ROC curve (AUC) for each model in the IRP path to that of 
the linear logistic regression model with the same covariates that assume no 
interaction between the SNPs. We employ 150-fold cross-validation, and use 
the approach of DeLong, DeLong and Clarke-Pearson (1988) to calculate 
p-values on the holdout AUC difference. 

Figure 7 illustrates the approach on a subset of two SNPs. We can see the 
results of IRP after 2, 4 and 7 iterations (the complete path has 8 iterations). 
The model at iteration 7 gives AUC of 0.5562, compared to 0.5501 for logistic 
regression, yielding a p-value of 0.0282 (which is only mildly indicative of 
a possible interaction given multiple comparison issues, as we hand-picked 
the two mutations and the number of iterations). The fit at iteration 7 
seems to support a super-additive interaction: presence of two copies each 
from both SNPs confers a jump in CD risk. 

We applied IRP on all subsets of three and four SNPs, and also the entire 
set of nine SNPs. Several models yielded AUC improvements over logistic 
regression which were significant at the nominal 0.05 level (like the two- 
variable model above), but none would withstand a multiple comparison 
correction. With all SNPs, the full path length was 96 iterations. The mini- 
mum model AUC for isotonic logistic regression was 0.6457 at 55 iterations 
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Fig. 7. Illustration of IRP with logistic log-likelihood loss function for modeling Crohn's 
disease, as a function of two SNPs: rsll805303 on chromosome 1 and rsl02120302 on 
chromosome 2. Models after iterations 2, 4 md 7 of IRP are shown. The model at itera- 
tion 7 gives the best predictive performance along the path. 

and for logistic regression was 0.6449. Because AUG was highly variable 
between folds for this high-dimensional model, this difference was insignifi- 
cant: p- value was 0.4. The final model had an AUG of 0.6456 and the p- value 
was 0.42. 

Our results on the TID and T2D data sets were qualitatively similar: no 
high order interactions of significance were identified by IRP. As the IRP 
approach possesses the flexibility and power to identify such interactions if 
they are strong enough (as illustrated in our simulations), we can conclude 
that the strong univariate signals identified by the WTCGC studies do not 
yield significant higher order interactions. This is of course in line with pre- 
vious findings [Gordell (2009), Emily et al. (2009)], but now also confirmed 
by our more sophisticated and flexible approach. 

7. Discussion and extensions. The IRP algorithm offers solutions to both 
the statistical and computational difficulties of isotonic regression. Algorith- 
mically, IRP solves (2) as a sequence of easier binary partitioning problems 
that are efficiently solved using network flow algorithms. From the statistical 
perspective, IRP generates a path of isotonic models, each defining a par- 
titioning of the space X into isotonic regions. The averages of observations 
in these regions comply with the isotonicity constraints (Theorem 4). In 
this view, IRP provides isotonic solutions along its path that are regularized 
versions of the globally optimal isotonic regression solution. 

Our discussion so far has focused on using the sum of squares loss function 
in (2) for fitting "standard" isotonic regression subject to squared error loss 
as well the logistic log-likelihood which we noted is an identical problem. 
A well-known result of Barlow and Brunk (1972) implies that the solution 
of a whole variety of loss functions subject to isotonicity constraints can 
be obtained by solving standard isotonic regression, as long as the loss can 
be written as minimizing X^"^;^ Wi{'^{zi) — XiZi)^ in z G R" for some convex 
differentiable ^ and some data-dependent values x and weights w. These re- 
sults imply that many other loss functions subject to isotonicity constraints 



26 



R. LUSS, S. ROSSET AND M. SHAHAR 



can optimally be solved by IRP via a reformulation to a problem of the 
form (2). Barlow and Brunk (1972) note that such transformations can be 
applied to many maximum likelihood estimation problems (subject to iso- 
tonicity constraints), including Bernoulli (as described in Section 6), multi- 
nomial, poisson and gamma distributed problems. We plan to investigate 
the applicability of the resulting regularization algorithms in future work. 

We can now also formalize the connection between IRP and the well- 
known work of Maxwell and Muckstadt (1985) [and similarly Roundy (1986)] 
that was mentioned in the Introduction. Maxwell and Muckstadt (1985) 
solved an operations research problem (related to scheduling reorder inter- 
vals for a production system) by reducing it to the optimization of a convex 
objective subject to isotonicity constraints. In our notation, their objective 
(i.e., loss function) is '^^^lici/iji + biyi), where q, hi are data-dependent 
nonnegative constants determined by their problem formulation. To apply 
the theory of Barlow and Brunk (1972), we reformulate their problem as 
minimizing Yll=i^i{^i ~ in z € R", that is, a standard weighted 

isotonic regression, and recovering = \J —z* (note that the isotonic regres- 
sion fits nonpositive observations —bi/ci). Indeed, the algorithm of Maxwell 
and Muckstadt (1985) is completely equivalent to applying IRP on this mod- 
ified problem! It should be emphasized, however, that Maxwell and Muck- 
stadt (1985) were interested in this algorithm purely as a means to reach the 
optimal solution, and were uninterested in statistical considerations which 
led us to consider intermediate IRP solutions as regularized isotonic models 
of independent interest. Spouge, Wan and Wilbur (2003) also used Maxwell 
and Muckstadt (1985) to inspire the partitioning algorithm for the standard 
isotonic regression problem, however, they do not make the connection using 
Barlow and Brunk (1972), and also have no statistical interests in mind. 

As our analysis and experiments have demonstrated, computation is not 
a significant concern with IRP, at least for moderate to large data sizes. 
However, overfitting is still a major concern as dimensionality grows. As 
demonstrated in Sections 4 and 5, while IRP offers partial protection from 
overfitting through its regularization behavior, even the first step in the IRP 
path could already suffer from high variance in a dimension as low as six. 
A key question pertains to identifying factors affecting this overfitting behav- 
ior, specifically characterizing situations in which the initial IRP iterations 
are less prone to overfitting. 

APPENDIX 

Proof of Proposition 1. We first rewrite both the IRP partition 
problem (3) and the maximal between-group variance partition problem (4). 
Assume V = A\J B and An B = {•}. Then it is easy to show = 
I^I^A + \B\yB which gives [y^ - Vv) = -\B\{yB - The objective 

function to (3) can be written \B\(jj^ ~yv) ~ \M^A ~yv) ^^'^ using the 
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previous relationship can again be rewritten 2|i?|(y^ — yy). An obvious 
property of the optimal IRP cut is that >yy- If we add this as a redun- 
dant constraint to the IRP partition (3), then we can find the same opti- 
mal partition by maximizing the square of the objective, that is, maximize 
4:\B\^(jjQ — yy)'^ subject to the appropriate constraints. The objective of the 
between-group variance partition (4) can be rewritten using the above rela- 
tionship as (|-B| -|- — yy)"^. Then denoting the IRP and maximal 
between-group variance objectives by g*{A,B) and g{A,B), respectively, we 
have g*{A,B) = 4:\A\\B\g[A,B) /n since \A\ + \B\ = n is constant. Eliminat- 
ing the constant 4/n gives the first result. 

In order to prove the second statement, notice that optimality of (3) 
and (4) gives \A*\\B*\g{A*^Bl) > \A*\\B*\g{A, B) and g{A,B) >g{A*,B*) 
which implies > This along with the relation 

{\A*\ + \B*\f = \A*\^ + 2\A*\\B*\ + \B*\'^ = \A\'^ + 2\A\\B\ + \B\'^ = {\A\ + \B\f 

gives + < |Ap + We use this to get the relation 

(1^*1 - \B*\f = - 2\A*\\B*\ + 

< - 2\A*\\B*\ + |Sp < |ip - 2\A\\B\ + |Sp 

= (1^1-1^1)', 
which gives the second result of the proposition. □ 

Proof of Theorem 2. Divide the blocks in V into three subsets: 

(1) C: union of all blocks in V that are "below" the algorithm cut. 

(2) U: union of all blocks in V that are "above" the algorithm cut. 

(3) M: union of K blocks in V that get broken by the cut (note that 
blocks in ^A may be separated by blocks in C or U). 

Define Mi {Mk) to be the minorant (majorant) block in M. For each 
define {M}!) as the groups in below (above) the algorithm cut. 
Define A^ <Z C (Af C U) as the union of blocks along the algorithm cut 
such that A'^ >~ Mh {A^ -< ). Refer to Figure 8 for an example of these 
definitions where AJ/ = A^ = A^ = A^ = {■} for simplicity. 

We use the above definitions and assumptions to state the following two 
consequences that cause a contradiction: 

(I) Vmi ^VMk optimality (i.e., according to KKT conditions) and 
isotonicity. 

(II) yj^,[^ > yy and yj^j^^ < yy. This is proven below. 

(II) implies y^j^ > IJmk '^^^ich contradicts (I) and we are left to prove (II) . 
Optimality of blocks Mi and Mk gives: 

(a) l/Mf >yMU, 

(b) Vm^ >yMU- 
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Fig. 8. Illustration of the proof of Theorem 2. Black lines separate blocks. The diagonal 
red line through the center demonstrates a cut of Algorithm 1. C is the union of blue blocks 
below the cut and U is the union of green blocks above the cut. White blocks are blocks 
that are potentially split by Algorithm 1. These blocks are split into Mi , . . . , below 
the cut and , . . . , above the cut. In the proof Mi = M/" U M^ for alii ^ 1, . . . ,5. 
The proof shows, for example, that if the algorithm splits Mi into Mi and Mi according 
to the defined cut in (5), then there must be no isotonicity violation when creating blocks 
from Ml and MY . However, since Mi is assumed to be a block, there must exist an 
isotonicity violation between Mi and Mi , providing a contradiction. 



The proof for yj^[_^ > t/y is as follows with two cases: 

(1) = {•}: Vjxju > yy because using the algorithm cut in (5), we have 

The first inequality is true about the cut because there exists no block 
below Ml to affect isotonicity. Then using (a), we get 

vm^ > yhju >yv =^ vmi > vv- 

(2) 7^ {•}: Vmi > Va^ > Vv- The first inequality is due to optimality 
and the second is again because the algorithm cut in (5) gives 

Y iyi-yv)>^ =^ Y yi>\'^i\yv =^ yAY>yv^ 

which again is possible because no block exists below A^ to affect isotonicity. 

The proof for y^v/^ < Vv ^ similar argument and hence gives (II) . The case 
-ftT = 1 is also trivially covered by the above arguments. We conclude that 
the algorithm cannot cut any block. □ 

The following remark is necessary for completeness of the proof of Theo- 
rem 2. 
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Remark 6. The case of two connected optimal groups having equal 
means need not be discussed in Theorem 2. In this event, the optimal so- 
lution to isotonic regression in not unique. It is trivial that Mi would not 
have been split by Algorithm 1 if yjijL = VnfU 7^ yy- Otherwise, consider the 
case y]\jL = y^l' — Vv assume Mi is a block broken by the cut in V. 
Ml and M^ are also possible blocks whereby M^ G C and A^Y ^ ^) and, 
hence, Mi = M^ U Mf ^ M. The same remarks apply to Mk- Thus, the 
proof still holds if there are multiple isotonic solutions. 

Remark 7. The case of multiple observations at the same coordinates 
can be disregarded. To see this, let J be a set of nodes with the same 
coordinates. From the constraints, yi = yj, for all i,jEJ and, thus, the 
number of observations can be reduced and all observations in J fit to the 
same value y. Then 



so that the sum of squared differences over J can be reduced to be a sin- 
gle weighted squared difference. Problem (2) becomes the weighted isotonic 
regression problem 



for which the KKT conditions imply that observations are again divided 
into k groups where the fits in each group take the weighted group mean 
Vv = ^i:xiev ('^iyi)/^i:x^ev''^i rather than the group mean. The optimal 
cut problem (5) changes to have Zi = Wi{yi — yy) and the above results on 
IRP generalize easily, noting that now the weighted algorithm cut implies 
Va ^ Vv ^^'^ ^ gi^oup A on the upper side of the cut such that no group exists 
below A that could affect isotonicity. 

Proof of Proposition 5. Any final partition can be represented by 
a simple tree. Consider level k of the tree. Let > 0.5 be the greatest p over 
levels 1, . . . , A: — 1 such that a partition of group size into two groups of 
size puk and {l—p)rn^ where n^. is the corresponding size of the partitioned 
group. Denote by Lk the largest group partitioned at iteration k whose size 
can be bounded by |Lfc| < np|. We next note that the complexity of solving 
a problem with n observations is higher than solving 2 problems with pn 
and (1 —p)n observations. Indeed, =pn^ + (1 —p)n^ >p'^n^ + (1 — p)^n^. 
Thus, we assume that at iteration k, we solve only problems of the largest 
possible size (rather than several problems of small size). The number of 
groups at iteration k can also be bounded by n/\Lii.\. Denote by Tp^{k) the 



{y-y,f = \J\\{y-yjf+ ^ y" 




(12) 
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complexity of partitioning all groups at level k. Then 

T,,{k) < O^^lLp) = 0{n\L\') < 0{n{nplf) = 0{n')pf . 
Then denote by K the total number of levels in the partition tree. We have 

K K oo 

k=i k=i k=i 
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